In search of a perfect trait set: A workflow presentation based on the conservation status assessment of Poland's dendroflora

Abstract Considering the dynamically changing environment, we cannot be sure whether we are using the best possible plant functional traits to explain ecological mechanisms. We provide a quantitative comparison of 13 trait sets to determine the availability of functional traits representing different plant organs, assess the trait sets with the highest explanatory potential, and check whether including a higher number of traits in a model increases its accuracy. We evaluated the trait sets by preparing 13 models using similar methodology and responding to a research question: How do models with different sets of functional traits predict the conservation status of species? We used the dataset covering all woody species from Poland (N = 387), with 23 functional traits. Our findings indicate that what matters most for a trait set of high explanatory power is the precise selection of those traits. The best fit model was based on the findings of Díaz et al. (2016; The global spectrum of plant form and function, Nature, 529, 167‐171) and included only six traits. Importantly, traits representing different plant organs should be included whenever possible: Three of the four best models from our comparison were the ones that included traits of various plant organs.

While trait databases are becoming increasingly comprehensive and accessible, selecting the most suitable functional traits for analysis still remains an uncertain matter (Lefcheck et al., 2015;Mlambo, 2014;Rosado et al., 2013). Previous studies provided instructions on the best possible choice of traits, for example, by using "all traits that are important for the function of interest" (Petchey & Gaston, 2006). Expert recommendations are also available for more specific topics, such as the use of traits for climate change analyses (Green et al., 2022;Kühn et al., 2021). Authors of the recently published "Handbook of Trait-Based Ecology"  recommend to study only those traits with a clear and specific hypothesis on how the given trait affects the process studied (effect traits), or with a hypothesis on how the trait is affected by the process studied (response traits). They encourage researchers to use numerous traits during the explorative phase of the study, to find out which traits would work for a particular research question, rather than in the final phase of analyses, where (in most cases) we should go toward reducing the number of traits used. This point of view suggests that the more specific the research question, the easier the selection of the traits. What we found particularly interesting was the mention of using numerous traits at the initial stage of data analysis to explore which ones will suit our research aim in the best possible way. Yet, limitations such as rather narrow representation of species measurements found in databases (approximately 17% worldwide; (Cornwell et al., 2019), uneven geographic coverage (biased towards the Global North; (Kleyer et al., 2008;Myers et al., 2000;Perez et al., 2019;Tavşanoğlu & Pausas, 2018), and biased coverage of plant organs (toward leaf traits; (Cornwell et al., 2019;Kühn et al., 2021)) often hinder the use of less popular traits. This way, some of the valuable traits may be omitted due to their relatively low level of previous usage documented in the scientific literature, or limited availability. Therefore, it is not surprising that numerous studies simply include the sets of functional traits proposed in classic papers from the field (Díaz et al., 2016;Westoby, 1998;Wright et al., 2004), as this methodology already has been tried by numerous researchers. This often makes it a default approach (Coleman et al., 2020;Finegan et al., 2015;Hoffmann et al., 2005). However, every research question is different and covers at least a different group of species, and a different spatiotemporal scale. Taking into account the suggestions of de , which concern careful, justified selection of traits for each study question, we may assume that for many research questions, those "default" trait sets are not the most well-fitting ones to explain particular ecological mechanisms. At the same time, we recognize that the quantitative comparison of different trait sets can be problematic, especially for scientists who do not specialize in functional ecology and use the functional traits as a tool to solve varying research problems. Therefore, we decided to quantitatively compare the usefulness of different trait sets to present an example workflow that we hope could be useful to other researchers while exploring their datasets.
We assumed that a good example dataset would be a data base representing a significant number of species with a high coverage of measurements of different traits. Then, we searched for a general research question, which would not determine the use of strictly limited sets of traits. Furthermore, when focusing on a more general research question, inclusion of traits representing diverse plant organs is particularly important. Here, we provide a quantitative comparison of the usefulness of different trait sets for predictions of species conservation status, as the fact of being threatened or not is determined by a complex set of factors that differs among species.
So far, there is a lack of quantitative comparisons of different trait sets. As mentioned above, numerous papers have presented the most optimal trait sets, widely used in ecological research (Díaz et al., 2016;Pierce et al., 2013;Westoby, 1998;Wright et al., 2004), but their results have never been compared. That is why we decided to study one of the crucial aspects of plant ecology in the context of changing climate, and to try to assess which traits will work best for predictions of which species could face the danger of being threatened. As a proxy of species threat, we used the conservation status of each species derived from the Red List and the Red Book of Ferns and Vascular Plants in Poland. Due to the low level of data coverage for herbaceous plants (Paź-Dyderska et al., 2020), we decided to focus on woody species. Here, we provide a quantified comparison based on an extensive regional trait database, including all the woody species occurring in Poland (N = 387) together with 23 functional traits, both numeric and categoric, selected based on their availability in open databases. We evaluated the traits by preparing 13 different models based on previously conducted studies, and responding to a research question: How do models with different sets of functional traits predict the conservation status of different species? This was inspired by Miles (2020), who found that morphological traits are a highly explanative proxy to assess the risk of extinction for iguanian lizards, even in cases of data deficient or poorly known species. For plants, size and reproduction-related traits also have high explanatory potential when predicting the probability of being threatened. For example, the recent study by Carmona et al. (2021) revealed that woody species (larger and with slowly adapting reproductive processes) show three times higher probability of being threatened when compared to herbaceous species (smaller and with more dynamic reproductive strategies). That is why in this approach we assumed that traits may explain the conservation status of the species.
In this study, we aimed to (1) determine the availability of functional traits representing different plant organs, (2) assess the trait sets with the highest explanatory potential, and (3) check whether model accuracy is enhanced by including a high number of traits. We hypothesized that (1) leaf traits are best represented both in terms of data availability and traits present in different models (Guerrero-Ramírez et al., 2021;McCormack et al., 2017;Perez et al., 2019), (2) the trait sets with the highest explanatory potential will be those including traits evenly distributed across different plant organs (Kleyer & Minden, 2015), and (3) that including more traits would not lead to a better model performance (Lefcheck et al., 2015).

| Data collection
We prepared a list of 387 woody species occurring in Poland. To include the whole dendroflora, accounting for trees, shrubs, and subshrubs (i.e., species characterized by the lignification of only the lower parts of the shoots, species only weakly woody or species that are woody but survive only a few growing seasons), we compiled data from numerous sources (CABI, 2020;Kaźmierczakowa et al., 2014Kaźmierczakowa et al., , 2016Rutkowski, 2006;Wild et al., 2019;Zieliński, 1987Zieliński, , 2004. We included both native and non-native species. Subsequently, we verified all the species names using the Global Biodiversity Information Facility (2020) and The Plant List (2020). After assessing the conservation status of the species studied using the Red List and the Red Book of Ferns and Vascular Plants in Poland (Kaźmierczakowa et al., 2014(Kaźmierczakowa et al., , 2016, we considered 36 species (i.e., 9.3% of all the species studied) to be threatened. Then, we compiled trait data (Tables 1 and 2) by joining TRY and BIEN data (Kattge et al., 2020;Maitner et al., 2018). We downloaded data for all traits that were available for our species. Then, we completed as much missing data as possible, searching for information in botanical handbooks, identification keys, and red lists of threatened species (CABI, 2020; Kaźmierczakowa et al., 2014Kaźmierczakowa et al., , 2016Rutkowski, 2006;Wild et al., 2019;Zieliński, 1987Zieliński, , 2004. These were mainly information on flowering onset, dispersal factors, leaf life span, nectar accessibility, and nectar amount. After completing the database, we decided to continue the analysis on the set of traits that were available for at least 25% of the species. As some parts of the trait values were still missing and we did not want to exclude less represented species, we performed data imputation based on correlations among traits and between traits and phylogeny , using an approach analogous to Dyderski and Jagodziński (2021). As previously stated, we only imputed data for the traits that had at least 25% completeness with measured values obtained from databases and other sources used (CABI, 2020; Kaźmierczakowa et al., 2014Kaźmierczakowa et al., , 2016 Rutkowski, 2006;Wild et al., 2019;Zieliński, 1987Zieliński, , 2004. We imputed missing values using the random forest method from the miss-Forest package (Stekhoven & Bühlmann, 2012) as recommended by Penone et al. (2014). We used the PVR package (Santos, 2018) to increase the predictive potential of imputation models using phylogenetic eigenvectors (Diniz-Filho et al., 1998). According to the suggestion of Penone et al. (2014), we used the first 15 phylogenetic factors, which covered 63.9% of phylogenetic distance variation. The normalized root mean squared error of imputed traits was 0.4255. The phylogenetic tree was obtained from the megatree by Jin and Qian (2019) included in the V.phylo.maker package.

| Data analysis
We performed all the analyses using the R software (R Core Team, 2020). We compared the performance of 13 models (Table 3) including different sets of functional traits according to the same methodological approach. We chose traits for the models named LHS, LES, Reich, Díaz, CSR, LeafStoich, and Morpho based on previous findings (Table 3). We developed models FlowerFruit, Seed, FlowFruitSeed, Leaf, and Stem by focusing on a given plant organ (or a certain function, in the case of the model focused on reproduction), to check their explanatory potential and to compare the predictive power of traits representing different organs (Kleyer & Minden, 2015). In the last model, we included all traits that we collected in this study. In terms of the selection of the models used in our study, we acknowledge that the best methodologically justified approach would have been to conduct a meta-analysis of all functional trait-related studies regarding species conservation and then select the most suitable trait sets for our study. However, due to the limited availability of trait measurements for the species we studied, we had to focus on the trait sets that used the available traits.
The trait selection process was not conducted systematically, and TA B L E 3 Overview of the models studied.  (Breiman, 2001). We chose the random forest algorithm, which is based on multiple decision or classification trees, which leads to better stability and accuracy of the model obtained (Breiman, 2001). Random forest has been successfully applied for assessing the conservation status of habitats (Reynolds et al., 2016) or for predicting functional traits of trees (Dyderski & Jagodziński, 2019 and upscales the observations from the underrepresented class (threatened species). To increase the robustness of the models, we ran SMOTE with repeated cross-validation subsampling.

| Model evaluation
To assess accuracy of the models, we used area under the receiver-

| Are the traits of different plant organs similarly available?
Categorical traits had the highest completeness (  (Table 1) were as follows: height (65.9%), onset of flowering (65.4%), and seed mass (64.3%). The mean completeness of leaf-related, reproductionrelated, and stem-related numerical traits was 42%, 52%, and 45%, respectively. As the completeness of the categorical traits was almost the same for all the traits and amounted to over 97%, we did not include them in this comparison of completeness differences among various organs. Leaf-related traits were the most numerous (11 of 23 traits), and also were most often used in the models, as they occurred in nine of 13 models (Table 3). There were eight reproduction-related traits (joined traits of flowers, fruits, and seeds), and two stem-related traits, which occurred in seven and four out of 13 models, respectively.

| Which trait sets have the highest explanatory potential?
The models with the highest AUC (Table 4)

| Does model accuracy increase with the number of traits used?
The highest AUC was in model Díaz that included only six traits (Table 4). However, model All, which included all 23 traits, provided the second highest AUC value. The next models with the highest AUC scores were CSR and Morpho, which included six and 11 traits, respectively. Model CSR, however, had lower values for the remaining parameters than the other three best models.
On the contrary, the rest of the models that included more than the median number of all traits (i.e., with more than six traits used) did not have equally satisfactory performance. Although

| DISCUSS ION
Our findings suggest that what matters most for a trait set of high explanatory power is not the number of traits included, but rather a precise selection of those traits, which is in line with the findings of Lefcheck et al. (2015), who similarly stated that more is not always better. We found no evidence that including more traits in the model will lead to higher model accuracy. Also, the models that include the traits representing all plant organs do not necessarily outrank models that include, for example, only leaf-related traits, in terms of their AUC score. Differences among models can result from the correlated structure of functional traits (Díaz et al., 2016;Kleyer et al., 2019;Wright et al., 2004). That way, including one or more aspects (traits) of a particular life strategy in the model can split the variability into more specific responses, or account for general trade-offs of resource utilization strategies. For example, leaf nitrogen and SLA are strongly correlated, as they reflect higher investment into leaf acquisition than leaf persistence (Díaz et al., 2016;Wright et al., 2004). Therefore, including both of them can provide deeper insight into both mechanisms (structural defense connected with leaf mechanical structure versus nitrogen utilization for photosynthesis efficiency), as in model Díaz. However, interactions between them can affect response curves. In contrast, accounting for one of these traits can provide a clearer response curve, but not accounting for such complexity (Figure 1. "Form traits"). Therefore, the Morpho, and All), we can suggest that using carefully selected traits covering more plant organs is highly beneficial for model performance (Kleyer & Minden, 2015;Lefcheck et al., 2015). In the case of our study, Model Díaz, which reached the highest AUC, included traits with the mean completeness of 51%, whereas the mean completeness of all numeric traits studied was 45%. This 6% higher completeness can be related to the proven explanatory potential of the traits included in model Díaz et al. (2016). Simultaneously, this may be because the "popular" traits are measured more often and subsequently are better covered in databases (Rosado et al., 2013).
Thus, some traits of high explanatory potential may be overlooked. Taking into consideration that those traits have been carefully selected based on 148 studies from all over the world, their potential in research on coping mechanisms of plants is highly significant.
F I G U R E 1 Partial dependence plots for the six most important traits of the four best models (from the top left corner: model Díaz, All, CSR, and Morpho. A higher average prediction indicates a higher probability that a given species is threatened, according to a given model. We reduced the number of traits presented to only the six most important ones to maintain clarity.
However, the ongoing, problematic issue of insufficient coverage of species with trait measurements hinders their immediate use in ecological studies. We come to a point where the traits of highest potential have been identified, but due to their low availability, we probably have to follow two paths simultaneously: work harder toward filling as many gaps in the trait databases as possible; and use trait data that are currently available in the best possible way.
Although our study provides a valuable quantification of frequently used model outputs, it still has some considerable draw- backs. First, we tested data collected only for one country and only for woody plants. Omitting the herbaceous species could impact the results (Gilliam & Roberts, 2003 & Gaston, 2006). Therefore, due to their specificity, they often are not universal. We, on the contrary, aimed to focus on sets of functional traits that included less specific traits with significant explanatory potential for a wide range of research questions.

| CON CLUS IONS
Here, we contribute to the development of guidelines on completing the optimal trait set for a given study. Primarily, as models based on trait sets proposed in previous studies had relatively better output in our study (Díaz et al., 2016;Pierce et al., 2013;Westoby, 1998), we recommend selecting potentially beneficial traits based on previous findings. Traits representing different plant organs should be included in the models whenever possible (Kleyer & Minden, 2015;Lefcheck et al., 2015), as three of the four best models from our comparison included traits of all aboveground organs. Using, for example, only leaf traits can also provide relatively good results, as the third best model only used leaf traits (Pierce et al., 2013). However, the high explanatory power of leaf traits can derive from high completeness of those traits: We should focus on improving the coverage of trait databases with new data, because this effort may enable the use of further, possibly powerful traits (Cornwell et al., 2019). Including numerous traits can also lead to improvement of model output, but the completeness of the traits should be as high as possible to capture their interspecific variability (Paź-Dyderska et al., 2020). For studies where the research questions are not highly specific, we recommend using the trait set from model Díaz (i.e., SLA, height, seed mass, leaf area, leaf nitrogen, and specific stem density), as it showed the best results using only six traits (Díaz et al., 2016), which we believe is a favorable trade-off for use in further ecological modeling research.

ACK N OWLED G M ENTS
We would like to thank Dr. Marcin K. Dyderski for critical comments on techniques of data analyses implemented in the study. We are grateful to Dr. Lee E. Frelich (Department of Forest Resources, University of Minnesota, USA) for linguistic revision of the manuscript. We are thankful to Dr. Julia Kemppinen and one anonymous reviewer for their insightful suggestions that helped us to significantly improve the manuscript.

FU N D I N G I N FO R M ATI O N
The study was supported by the Institute of Dendrology, Polish Academy of Sciences, Kórnik, Poland.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

DATA AVA I L A B I L I T Y S TAT E M E N T
Datasets will be publicly available when the article is published.